Serious SQL Health Analytics
Photo credit: Danny Ma @ Data with Danny
Introducing Serious SQL by Danny Ma - A comprehensive SQL learning experience.
Link to GitHub repository, including data and final output.
About the data set
Health data measured by users using their wearables and enter into a tracking device. Columns in the data table "user_logs" in the "health" schema are:
Photo credit: Danny Ma @ Data with Danny
Practice questions
- How many unique users exist in the logs dataset?
Answer: 554select count (distinct (id)) as unique_id from health.user_logs;
- How many duplicated records are there?
Answer: 19,449with cte as ( select id, log_date, measure, measure_value, systolic, diastolic, count(*) as count_records from health.user_logs group by id, log_date, measure, measure_value, systolic, diastolic ) select sum(count_records) from cte where count_records > 1;
- How many unquie records are there?
Answer: 24,442with cte as ( select id, log_date, measure, measure_value, systolic, diastolic, count(*) as count_records from health.user_logs group by id, log_date, measure, measure_value, systolic, diastolic ) select sum(count_records) from cte where count_records = 1;
- Who are the users with the most duplicated records?
Answer: 054250c692e07a9fa9e62e345231df4b54ff435d with 17279 duplicated recordswith cte as ( select id, log_date, measure, measure_value, systolic, diastolic, count(*) as count_records from health.user_logs group by id, log_date, measure, measure_value, systolic, diastolic ) select id, sum(count_records) as sum_records from cte where count_records > 1 group by id order by sum_records desc;
- How many duplicated records in each measure category?
Answer: blood_glucose = 18,471; weight = 677; blood_pressure = 301with cte as ( select id, log_date, measure, measure_value, systolic, diastolic, count(*) as count_records from health.user_logs group by id, log_date, measure, measure_value, systolic, diastolic ) select measure, sum(count_records) as sum_records from cte where count_records > 1 group by measure order by sum_records desc;
- Ignore duplications, for weight, what are the minimum and maximum records?
Answer: 39,642,120 and 0select max(measure_value) as max_rec, min(measure_value) as min_rec from health.user_logs where measure = 'weight';
- Ignore duplications, allocate weights in to 100 percentitles and find outliers.
Answer: ouliers are out side the range of 1.8 to 201select measure_value, ntile(100) over( order by measure_value ) as percentile from health.user_logs where measure = 'weight';
- Ignore duplications and exclude outliers, what are the avg, median, mode, stde of weight measures?
Answer: mean_value/median_value/mode_value/standard_deviation = 80.76/75.98/68.49/26.91select round( avg(measure_value), 2 ) as mean_value, round( cast( percentile_cont(0.5) within group ( oder by measure_value ) as numeric ), 2 ) as median_value, round( mode() within group ( order by measure_value ), 2 ) as mode_value, round( stddev(measure_value), 2 ) as standard_deviation from health.user_logs where measure = 'weight' and (measure_value between 1.8 and 201);
- How many total measurements do we have per user on average?
Answer: 79.23with cte as( select id, count (*) as frequency from health.user_logs group by id ) select round( sum (frequency)::numeric / count (distinct id), 2 ) as record_per_customer from cte;
- What about the median number of measurements per user?
Answer: 2with cte as( select id, count (*) as frequency from health.user_logs group by id ) select round(percentile_cont(0.5) within group ( order by frequency) ::numeric, 2) as median from cte;
- How many users have 3 or more measurements?
Answer: 209with cte as( select id, count (*) as frequency from health.user_logs group by id ) select count (id) from cte where frequency >= 3;
- How many users have 1,000 or more measurements?
Answer: 5with cte as( select id, count (*) as frequency from health.user_logs group by id ) select count (id) from cte where frequency >= 1000;
- How many users have logged blood glucose measurements?
Answer: 325with cte as( select id, measure, count (*) as frequency from health.user_logs group by id, measure ) select count (id) from cte where measure = 'blood_glucose';
- How many user have at least 2 types of measurements?
Answer: 204with cte as( select id, count (distinct measure) as frequency from health.user_logs group by id ) select count (id) from cte where frequency >=2;
- How many users have all 3 measures - blood glucose, weight and blood pressure?
Answer: 50with cte as( select id, count (distinct measure) as frequency from health.user_logs group by id ) select count (id) from cte where frequency =3;
- For users that have blood pressure measurements, what is the median systolic/diastolic blood pressure values?
Answer: 129/79select percentile_cont(0.5) within group ( order by systolic) ::numeric as median_systolic, percentile_cont(0.5) within group ( order by diastolic) ::numeric as median_diastolic from health.user_logs where (measure = 'blood_pressure') and (systolic is not null) and (diastolic is not null);